Diacritization: A Challenge to Arabic Treebank Annotation and Parsing

نویسندگان

  • Mohamed Maamouri
  • Ann Bies
  • Seth Kulick
چکیده

Arabic diacritization (referred to sometimes as vocalization or vowelling), defined as the full or partial representation of short vowels, shadda (consonantal length or germination), tanween (nunation or definiteness), and hamza (the glottal stop and its support letters), is still largely understudied in the current NLP literature. In this paper, the lack of diacritics in standard Arabic texts is presented as a major challenge to most Arabic natural language processing tasks, including parsing. Recent studies (Messaoudi, et al. 2004; Vergyri & Kirchhoff 2004; Zitouni, et al. 2006 and Maamouri, et al. forthcoming) about the place and impact of diacritization in text-based NLP research are presented along with an analysis of the weight of the missing diacritics on Treebank morphological and syntactic analyses and the impact on parser development.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تصحیح خودکار خطا در درخت بانک نحوی با استفاده از یادگیری ماشینی انتقال محور

The Treebank is one of the most useful resources for supervised or semi-supervised learning in many NLP tasks such as speech recognition, spoken language systems, parsing and machine translation. Treebank can be developded in different ways that could be, generally, categorized in manually and statistical approaches. While the resulted Treebank in each of these methods has the annotation error,...

متن کامل

Diacritization for Real-World Arabic Texts

For Arabic, diacritizing written text is important for many NLP tasks. In the work presented here, we investigate the quality of a diacritization approach, with a high success rate for treebank data but with a more limited success on realworld data. One of the problems we encountered is the non-standard use of the hamza diacritic, which leads to a decrease in diacritization accuracy. If an auto...

متن کامل

Enhanced Annotation and Parsing of the Arabic Treebank

The Arabic Treebank at the Linguistic Data Consortium has significantly revised and enhanced its annotation guidelines and annotation procedure over the past year. The revised syntactic guidelines are now being applied in annotation production, and the combination of the revised guidelines and a period of intensive annotator training has raised inter-annotator agreement f-measure scores already...

متن کامل

Creating a Methodology for Large-Scale Correction of Treebank Annotation: The Case of the Arabic Treebank

The LDC Arabic Treebank team has significantly revised and enhanced its annotation guidelines and annotation procedures over the last two years, with the goal of reducing inconsistency in annotation in the Treebank. We have now completed automatic and significant manual revisions to 738,845 tokens/words in total, bringing them into line as far as possible with the new annotation guidelines and ...

متن کامل

Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation

The Arabic Treebank (ATB), released by the Linguistic Data Consortium, contains multiple annotation files for each source file, due in part to the role of diacritic inclusion in the annotation process. The data is made available in both ”vocalized” and ”unvocalized” forms, with and without the diacritic marks, respectively. Much parsing work with the ATB has used the unvocalized form, on the ba...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006